vault: backport instrumentation and observation blob parallelization to 2.39.2#21762
vault: backport instrumentation and observation blob parallelization to 2.39.2#21762prashantkumar1982 merged 15 commits intorelease/2.39.2from
Conversation
Instrument all public KVStore methods with a histogram metric (platform_vault_plugin_kv_operation_duration_seconds) to help diagnose high P95 Observation latencies in production. Also adds a max-batch Observation test exercising 10 GetSecrets requests with 10 encryption keys each plus 20 pending queue items.
|
✅ No conflicts with other open PRs targeting |
|
I see you updated files related to
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fc7813b203
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| batchSize.DefaultValue = int(configProto.BatchSize) | ||
| } | ||
| cfg.MaxBatchSize, err = limits.MakeUpperBoundLimiter(r.limitsFactory, batchSize) | ||
| cfg.MaxBatchSize, err = limits.MakeUpperBoundLimiter(r.limitsFactory, cresettings.Default.VaultPluginBatchSizeLimit) |
There was a problem hiding this comment.
Apply offchain BatchSize when creating MaxBatchSize limiter
NewReportingPlugin still parses configProto, but MaxBatchSize is now always built from cresettings.Default.VaultPluginBatchSizeLimit and never from configProto.BatchSize. This silently ignores operator-provided batch-size overrides and can change runtime request/pending-queue limits from what the DON config specifies, which can alter queue truncation/validation behavior in production without any configuration error.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fa7c47392f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for i, payload := range blobPayloads { | ||
| g.Go(func() error { | ||
| blobHandle, ierr2 := blobBroadcastFetcher.BroadcastBlob(broadcastCtx, payload, ocr3_1types.BlobExpirationHintSequenceNumber{SeqNr: seqNr + 2}) |
There was a problem hiding this comment.
Fail fast before launching all blob broadcasts
This loop starts a goroutine for every pending payload before any error is observed, and errors are only handled at g.Wait(). In practice, if one BroadcastBlob call fails while others succeed, Observation still returns an error (so the round retries) but many uploads from that failed round may already have been persisted; on blob backends that account each upload against unexpired blob quotas, this regression can quickly consume per-oracle blob capacity and lead to repeated round failures.
Useful? React with 👍 / 👎.
|




Summary
This backports the Vault instrumentation and Observation blob broadcast parallelization changes to
release/2.39.2.What changed
Observation()Testing